{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 - Apache Spark ML - Create train and test set\n",
    "\n",
    "In this chapter, you will:\n",
    "\n",
    "• Create a test and train set\n",
    "\n",
    "• Learn more Spark functionality and how to use it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.read.csv(\"testing_bot_data.csv\", header= True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession \n",
    "\n",
    "spark = SparkSession.builder \\\n",
    "    .master('local[*]') \\\n",
    "    .appName(\"ApacheSparkML\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After filtering and working on the train DataFrame, we need to make sure the test set has the same structure.\n",
    "\n",
    "Load testing data from CSV file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test = spark.read.csv(\"testing_bot_data.csv\", header= True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clean and prep testing data as well:\n",
    "Remember that here we don't have bot value.\n",
    "\n",
    "You will not drop id, as we will use it to compare results later.\n",
    "\n",
    "Excecute the next commands:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql.types import IntegerType, ArrayType, BooleanType, StringType\n",
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.functions import when\n",
    "\n",
    "# Dropping irrelevant columns and duplicates\n",
    "df_test = df_test.drop('default_profile_image','has_extended_profile','url','created_at','lang')\n",
    "df_test = df_test.dropDuplicates()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# First Transformation\n",
    "df_test = df_test.withColumn(\"friends_count\", df_test[\"friends_count\"].cast(IntegerType()))\n",
    "df_test = df_test.withColumn(\"listed_count\", df_test[\"listed_count\"].cast(IntegerType()))\n",
    "df_test = df_test.withColumn(\"favourites_count\", df_test[\"favourites_count\"].cast(IntegerType()))\n",
    "df_test = df_test.withColumn(\"statuses_count\", df_test[\"statuses_count\"].cast(IntegerType()))\n",
    "df_test = df_test.withColumn(\"verified\", df_test[\"verified\"].cast(BooleanType()))\n",
    "df_test = df_test.withColumn(\"default_profile\", df_test[\"default_profile\"].cast(BooleanType()))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Second Transformation\n",
    "df_test = df_test.withColumn('default_profile',df_test['default_profile'].cast(IntegerType()))\n",
    "df_test = df_test.withColumn('name',when(df_test['name'].isNull(),0).otherwise(1))\n",
    "df_test = df_test.withColumn('verified',df_test['verified'].cast(IntegerType()))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Theird Transformation\n",
    "df_test = df_test.withColumn('verified',when(df_test['verified'].isNull(),0).otherwise(df_test['verified']))\n",
    "df_test = df_test.withColumn('default_profile',when(df_test['default_profile'].isNull(),0).otherwise(df_test['default_profile']))\n",
    "df_test = df_test.withColumn('location',when(df_test['location'].isNull(),0).otherwise(1))\n",
    "df_test = df_test.withColumn('status',when(df_test['status'].isNull(),0).otherwise(1))\n",
    "df_test = df_test.withColumn('screen_name',when(df_test['screen_name'].isNull(),0).otherwise(1))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Forth Transformation\n",
    "df_test = df_test.dropna(subset=['description'])\n",
    "\n",
    "def split_and_set(some_str):\n",
    "    if isinstance(some_str, str):\n",
    "        some_str = ''.join(c for c in some_str if c not in \"[](){}<>,'/.\")\n",
    "        return list(set(some_str.split(' ')))\n",
    "    return some_str\n",
    "\n",
    "list_udf = udf(lambda y: split_and_set(y), ArrayType(StringType()))\n",
    "df_test = df_test.withColumn('description', list_udf(df_test['description']))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "\n",
    "# Fifth Transformation - fill NA:\n",
    "df_test = df_test.fillna({'followers_count':0,'statuses_count':0,'favourites_count':0,'listed_count':0,'friends_count':0,})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save to parquet:\n",
    "\n",
    "Code sample:\n",
    "```python\n",
    "df_test.write.parquet(\"test_data\")\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `test_data` file that you save doesn't consist of information about bots at all.\n",
    "\n",
    "We can use it to compare various algorithms and see how they behave.\n",
    "However, since our training data is supervised, we would like to test it with classified data.\n",
    "This will help us estimate our model.\n",
    "\n",
    "Hence, you will split the training data into testing and train data set.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the train data:\n",
    "df_train = spark.read.parquet(\"final_train_data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the training data into training and test sets, hold 30% out for testing.\n",
    "\n",
    "Use randomSplit function:\n",
    "\n",
    "```python\n",
    "(trainingData, testData) = some_data.randomSplit((0.7, 0.3))\n",
    "```\n",
    "\n",
    "<details><summary>Hint</summary>\n",
    "<p>\n",
    "\n",
    "Use randomSplit function:\n",
    "    \n",
    "```python\n",
    "(trainingData, testData) = data.randomSplit((0.7, 0.3))\n",
    "\n",
    "```  \n",
    "</p>\n",
    "</details>\n",
    "\n",
    "Remember to validate yourself with count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the split data for the next Chapter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "testData.write.mode('overwrite').parquet(\"classified_test_data\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainingData.write.mode('overwrite').parquet(\"classified_train_data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Well Done! 👏👏👏\n",
    "\n",
    "\n",
    "## You just finished:  Apache Spark ML - Create train and test set \n",
    "\n",
    "\n",
    "## Next exercise: Apache Spark ML and create machine learning models"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
